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I. INTRODUCTION 

Most reactions in cells are carried out by enzymes In many instances the rates of 
enzyme-catalyzed reactions are enhanced by million fold. A significantly large fraction of 
all known enzymes are proteins which are made from twenty naturally occurring amino 
acids. The amino acids are linked by peptide bonds to form polypeptide chains. The 
primary sequence of a protein specifies the linear order in which the amino acids are linked. 
To carry out the catalytic activity the linear sequence has to fold to a well defined three 
dimensional structure. In cells only a relatively small fraction of proteins require assistance 
from chaperones (helper proteins) [fj]. Even in the complicated cellular environment most 
proteins fold spontaneously upon synthesis. The determination of the three dimensional 
folded structure from the one dimensional primary sequence is the most popular protein 
folding problem. 

For a number of years the protein folding problem remained only of academic interest. 
The synthesis of proteins in cells was described by Crick Schematically this process can be 
represented as DNA — > RNA — > Proteins. This proposal and Anfinsen's || demonstration 
that a denatured protein would fold to the native conformation under suitable conditions 
was sufficient to understand the role of protein folding in cells. However, in the last couple 
of decades many diseases have been directly linked to protein folding (especially misfolding) 
Thus, there is an urgency to understand the mechanisms in the formation of folded 
structures. Biotechnology industry is also interested in the problem because of the hope that 
by understanding the way polypeptide chains fold one can design molecules (using natural 
or synthetic constituents) of use in medicine. Finally, the full potential of the human genome 
project involves understanding what the genes encode. For all these profoundly important 
reasons the protein folding problem has taken center stage in molecular biology. 

Because this problem is complex several venues of attack have been devised in the last 
fifteen years. A combination of experimental developments (protein engineering, advances in 
X-ray and NMR, various time resolved spectroscopies, single molecule manipulation meth- 
ods) and theoretical approaches (use of statistical mechanics, computational strategies, use 
of simple models) P-^j has lead to a greater understanding of how polypeptide chains reach 
the native conformation. 

From our perspective there are four major problems that comprise the protein folding 
enterprise. They are: 
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a) Prediction of the three dimensional fold of a protein given only the amino acid se- 
quence. This is referred to as the structure prediction problem. In Fig. (1) we show the three 
dimensional structure of myoglobin. The structure prediction problem involves determining 
this fold using the primary sequence which is given in the lower part of Fig. (1). 

b) Given that a sequence folds to a known native structure what are the mechanisms 
in the transition from the unfolded conformation to the folded state? This is the kinetics 
problem whose solution requires elucidation of the pathways and transition states in the 
folding process. 

c) How to design sequences that adopt a specified fold ||? This is the inverse protein 
folding problem that is vital to the biotechnology industry. 

d) There are some proteins that do not spontaneously reach the native conformation. In 
the cells these proteins fold with the assistance of helper molecules referred to as chaperonins. 
The chaperonin-mediated folding problem involves understanding the interactions between 
proteins. 

It is not the aim of this chapter to introduce the reader to all the areas listed above. Our 
goal is modest. We describe some of the theoretical developments which arose from studies 
of caricatures of proteins. Such models were designed to understand certain general features 
about protein structures and how these are kinetically reached. To keep the bibliography 
compact we mostly cite review articles. The interested reader can find the original papers 
in these cited works. We hope that this short introduction will entice the reader to delve 
into the ever surprising world of biological macromolecules. 



II. RANDOM HETEROPOLYMER AS A CARICATURE OF PROTEINS 

In homopolymers all the constituents (monomers) are identical, and hence the interac- 
tions between the monomers and between the monomers and the solvent have the same 
functional form. To describe the shapes of a homopolymer (in the limit of large molecular 
weight) it is sufficient to model the chain as a sequence of connected beads. Such a model 
can be used to describe the shapes that a chain can adopt in various solvent conditions. A 
measure of shape is the dimension of the chain as a function of the degree of polymerization, 
N. If N is large then the precise chemical details do not affect the way the size scales with 
iV ||. In such a description a homopolymer is characterized in terms of a single parameter 
that essentially characterizes the effective interaction between the beads which is obtained 
by integrating over the solvent coordinates. 

Proteins are clearly not homopolymers because many energy scales are required to char- 
acterize the polypeptide chain. Besides the excluded volume interactions and hydrogen 
bonds the potential between the side chain depend on the nature of the residues . There- 
fore, as a caricature of proteins the heteropolymer model is a better approximation. A 
convenient limit is the random heteropolymer for which approximate analytic treatments 
are possible |IJ . In a random heteropolymer the interactions between the beads are assumed 



to be randomly distributed. Some of the interactions are attractive (which is responsible 
for conferring globularity to the chain) while others are repulsive and these residues are 
better accommodated in an extended conformation. In proteins water is a good solvent for 
polar residues while it is poor solvent for hydrophobic residues. (In a good solvent contacts 
between the monomer and solvent are favored whereas in a poor solvent the monomers are 
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attracted to each other). Because only 55% of the residues in proteins are hydrophobic it 
is clear that in a typical protein energetic frustration plays a role. In addition because of 
chain connectivity there is also topological frustration. This arises because residues that 
are proximal tend to form structures on short length scales. Assembly of such short short 
length scale structures would typically be incompatible with the global fold giving rise to 
topological frustration. Even if energetic frustrations are eliminated a polypeptide chain (in 
fact biomolecules) are topologically frustrated [7j|. 

In the field of spin glasses and structural glasses such frustration effects are well known 
nfl. Thus, it was natural to suggest that random heteropolymers could serve as a simple 



representation of polypeptide chains. In Appendix A we sketch computational details for one 
model of random heteropolymers. Bryngelson and Wolynes proposed, using phenomenologi- 
cal arguments, that the Random Energy Model (REM) would be an appropriate description 
of some aspects of proteins |12|,|5|. The rationale for this is the following: Consider the 



exponentially large number of conformations. Because of the presence of several conflicting 
energies in a polypeptide chain it is natural to assume that these energies are randomly 
distributed. If there are no correlations between these energies and if the distribution is 
Gaussian one gets the REM. Of course, in the REM model the chain connectivity is ignored 
and there is no manifest for a spatial dependence of the chain coordinates. We show in 
Appendix B that in the compact phase the random heteropolymer is equivalent to REM. 

The random heteropolymer models of proteins are interesting from a statistical mechan- 
ical perspective. However, they do not explain the key characteristics of proteins such as 
reversible and cooperative folding to a unique native conformation. Moreover, the theories 
for heteropolymers suggest that typically the energy landscapes for these systems are ex- 
tremely rugged consisting of many minima that are separated by barriers of varying heights 
|TT| . This would mean that kinetically it would be impossible for chain with a typical real- 
ization of interactions to reach the ground state in finite time scales. Thus the dynamics of 
such random heteropolymer models typically exhibits glassy behavior. Natural proteins do 
not exhibit any hall mark of glassy dynamics at most temperatures of interest. It follows 
that a certain refinement of the random heteropolymers is required to capture protein-like 
properties. One of the important theoretical advances is the observation that very simple 
minimal models can be constructed that capture many (not all) of the salient features 
observed in proteins ||13|| . The simplest manifestation of such models are the lattice repre- 



sentation of polypeptide chains. In the next section we introduce the models and describe 
a few results that have been obtained by numerically exploring their behavior. 



III. LATTICE MODELS OF PROTEINS 



The computational protocol for describing protein folding mechanisms is straightforward 
in principle. The dynamics are well described by the classical equations of motion. Simula- 
tions of a monomeric protein involves equilibrating the polypeptide chain in a box of water 
molecules at the desired temperature and density. If an appropriately long trajectory is gen- 
erated then the dynamics of the protein can be directly monitored. There are two crucial 
limitations that prevent a straightforward application of this approach to study the folding 
of proteins. First, the interaction potentials or the force fields for such a complex system 
are not precisely known. Molecular dynamics simulations in the standard packages use po- 
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tentials that rely on the transferability hypothesis i.e., interactions designed in one context 
can be used in aqueous medium and for larger systems. The need to compute potentials 
that can be used reliably in the simulations of protein dynamics remains acute. 

The second problem is related to the limitations in generating really long time trajectories 
that can sample all the relevant conformational space of proteins. To observe reversible 
folding of even a moderate sized protein requires simulations that span the millisecond 
time scale. More importantly making comparisons with experiments involves generating 
many (greater than perhaps 100) folding trajectories so that a reliable ensemble average 
is obtained. Thus, we need to make progress on both (force fields and enhanced sampling 
techniques in long time simulations) fronts before a straightforward all atom simulations 
become routine. 

In light of the above mentioned difficulties various simplified models of proteins have 
been suggested ||13|| . The main rationale for using such drastic simplifications is that a 



detailed study of such models can enable us to decipher certain general principles that 
govern the folding of proteins |5|-0. For these class of models detailed computations without 
sacrificing accuracy is possible. Such an approach has yielded considerable insights into the 
mechanisms, time scales, and pathways in the folding of polypeptide chains. In this section 
we will outline some of the results that have been obtained (largely from our group) with 
the aid of simple lattice models of proteins. 

In the simple version of the lattice representation of proteins the polypeptide chain is 
modeled as a sequence of connected beads. The beads are confined to the sites of a suitable 
lattice. Host of the studies have used cubic lattice. To satisfy the excluded volume condition 
only one bead is allowed to occupy a lattice site. If all the beads are identical we have a 
homopolymer model whose characteristics on lattices have been extensively studied. To 
introduce protein-like character the interactions between beads (ones separated by at least 
three bonds) that are near neighbors on a lattice are assumed to depend on the nature of 
the beads. The energy of a conformation, specified by {r,, i — 1,2, iV}, is 

£({*}) = 5>(| r<- TV | -a)fl y (1) 

i<j 

where N is the number beads in the chain, a is the lattice spacing, and Bij is the value 
of the contact interaction between beads i and j. We will consider different forms of B^. 
Since this model can be viewed as a coarse grained representation of the a-carbons of the 
polypeptide chain the value of a is typically taken to be about 3.8A. 

Lattice models have been used for a long time in polymer physics |TJ]]. They were 



instrumental in computing many properties (scaling of the size of the polymer with N, 
distribution of end-to-end distance etc.) of real homopolymer chains. In the context of 



proteins lattice models were first introduced by Go and coworkers [15]. The currently popular 
Go model considers only interactions between residues (beads on the lattice) that occur in 
the native (ground) state. Thus, in this "strong specificity limit" only native contacts are 
taken into account. It follows that in this version of the Go model the chain is forced to 
adopt the lowest energy conformation at low temperatures. Go also considered a variant of 
this model in which certain non-native contacts are allowed. Although these models were 
insightful Go and coworkers did not use them to obtain plausible general principles of protein 
folding. This was partly due to the fact that in their studies they typically used long chains, 
and hence exact enumeration was not possible. 
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Simple lattice models, with the express purpose of obtaining minimal representations 



of polypeptide chains, were first suggested by Chan and Dill [|16| . In order to account for 
the major interactions in proteins these authors argued that the twenty naturally occurring 
amino acids can be roughly divided into categories, namely, hydrophobic (H) and polar (P). 
Chan and Dill have suggested that this simple HP model can capture many salient features 
of proteins. They also suggested that many of the conceptual puzzles (Levinthal paradox in 
particular) could be addressed by systematically studying short chains. This simple exactly 
enumerable HP model and their variants have been used to understand cooperativity, folding 



kinetics, and designability of protein structures Thus, it is instructive to describe the 



calculations that have been done using lattice models. A study of such models indeed 
provides a god introduction to the computational aspects of protein folding. 

Emergence of Structures: The sequence space of proteins is extremely dense. The number 
of possible protein sequences is 20^. It is clear that even by the fastest combinatorial 
procedure only a very small fraction of such sequences could have been synthesized. Of 
course, not all of these sequences will encode protein structures which for functional purposes 
are constrained to have certain characteristics. A natural question that arises is how do 
viable protein structures emerge from the vast sea of sequence space? The two physical 
features of folded proteins are : (1) In general native proteins are compact but not maximally 
so. (2) The dense interior of proteins are largely made up of hydrophobic residues and the 
hydrophilic residues are better accommodated on the surface. These characteristics give the 
folded proteins a lower free energy in comparison to all other conformations. 

Lattice models are particularly suited for answering the question posed above. We will 
show that the two physical restrictions are sufficient to rationalize the emergence of very 
limited (believed to be only on the order of a thousand or so) protein-like structures. To 
provide a plausible answer to this question using lattice models we need to specify the form 
of the interaction matrix elements Bij. For purposes of illustration we consider the random 
bond (RB) model in which the elements are distributed as 

P(i) s )^ H p[-^|. (2) 



Here cr(= 1) is the variance in B^ and the hydrophobicity parameter Bq is the mean value. 
We chose B = (half the beads are hydrophobic) and B = —0.1. The latter is motivated 
by the observation that in natural proteins roughly 55% of the residues are hydrophobic 
0- 

Protein-like structures are not only compact but also have low energy. With this in mind 
we have calculated the number of compact structures (CSs) as CSs with low energy for a 
given N. The number of CSs in its most general form may be written as 

C N {CS) ~ Z N Z? ^N^" 1 (3) 

where InZ is the conformational free energy (in units of &bT), Z\ is the surface fugacity, 
d is the spatial dimension, and 7 C is measures possible logarithmic corrections to the free 
energy. It is clear that natural proteins are relatively unique and hence their number on an 
average has to grow at rates that are much smaller than that given in Eq. (21). To explore 
this we have calculated by exact enumeration the number of compact structures, Cn(CS) 
and the number of minimum energy compact structures Cn(MES) as a function of N. 



5 



We performed exhaustive enumeration of all self-avoiding conformations, to explore the 
conformational space of the polypeptide chain of a given length. In order to reduce the 
sixfold symmetry on the cubic lattice we fixed the direction of the first monomeric bond in 
all conformations. The remaining conformations are related by eightfold symmetry on the 
cubic lattice (excluding the cases when conformations are completely confine to a plane or 
straight line). To decrease further the number of conformations to be analyzed the Martin 
algorithm [p| was modified to reject all conformations related by symmetry. 

We define MES as those conformations, whose energies lie within the energy interval 
A above above the lowest energy Eq. Several values for A were used to ensure that no 
qualitative changes in the results are observed. We set A to be constant and equal to 1.2 
(or 0.6) (definition (i)). We have also tested another definition for A, according to which 
A = 1.3|J5o — tBo\/N, where t is the number of nearest neighbor contacts in the ground 
state (definition (ii)). It is worth noting that in the latter case A increases with N. Both 
definitions yield equivalent results. Using these definitions for A, we computed C(MES) as 
a function of the number of residues N. 

The computational technique involves exhaustive enumeration of all self-avoiding confor- 
mations for iV < 15 on cubic lattice. In doing so we calculated the energies of all conforma- 
tions according to Eq. (19), and then determined the number of MES. Each quantity, such as 
the number of MES, C(MES), the lowest energy Eq, the number of nearest-neighbor contacts 
t in the lowest energy structures, is averaged over 30 sequences. Therefore, when referring 
to these quantities, we will imply their average values. To test the reliability of the compu- 
tational results an additional sample of 30 random sequences was generated. Note that in 
the case of C(MES) we computed the quenched averages, i.e., C(MES) = exp[in[c(MES)]], 
where c is the number of MES for one sequence. 

The number of MES C(MES) is plotted as a function of the number of residues N in Fig. 
(2) for Bq = —0.1 and A = 0.6. A pair of squares at given N represents C(MES) computed 
for two independent runs of 30 sequences each. For comparison, the number of self-avoiding 
walks C(SAW) and the number of CS C(CS) are also plotted in this figure (diamonds and 
triangles, respectively). The most striking and important result of this graph is the following: 
As expected on general theoretical grounds, C(SAW) and C(CS) grow exponentially with N, 
whereas the number of MES C(MES) exhibits drastically different scaling behavior. There is 
no variation in C(MES) and its value remains practically constant within the entire interval 
of N starting with N = 7. We find (see Fig. (2)) that C(MES) « 10 1 . This result further 
validates our earlier finding for two dimensional model. These results suggest that C(MES) 
grows (in all likelihood) only as InN with N. Thus the restriction of compactness and low 
energy of the native states may force an upper bound on the number of distinct protein 
folds. 

3D HP model: The calculations described above suggest that upon imposing minimal restric- 
tions on the structures (compactness and low energies) the structure space becomes sparse. 
As suggested before this must imply that each basin of attraction (corresponding to a given 
MES) in the structure space must contain numerous sequences. The way these sequences 
are distributed among the very slowly growing number (with respect to N) of MES, i.e., 
the density of sequences in structure space, is an important question. This was beautifully 



addressed in the paper by Li et. al []19 |. They considered a three dimensional (N = 27) 



cubic lattice. By using HP model and restricting themselves to only maximally compact 
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structures as putative native basins of attractions (NBA) they showed certain basins have 
much larger number of sequences. In particular, they discovered that one of the NBAs 
serves as a ground state for 3794 (total number is 2 27 ) sequences and hence was considered 
most designable (Fig. (3)). The precise density of sequences among the NBAs is clearly 
a function of the interaction scheme. These calculations and the arguments presented in 
the previous subsection using the random bond model point out that since the number of 
NBA for the entire sequence space is small it is likely that proteins could have evolved ran- 
domly. Naturally occurring folds must correspond to one of the basins of attraction in the 
structure space so that many sequences have these folds as the native conformations, i.e., 
these are highly designable structures in the language of LWC [fOJ. These ideas have been 
further substantiated by Lindgard and Bohr [20|, who showed that among maximally com- 
pact structures there are only very few folds that have protein-like characteristics. These 
authors also estimated using geometrical characteristics and stability arguments that the 
number of distinct folds is on the order of a thousand. All of these studies confirm that the 
density of the structure space is sparse. Thus, each fold can be designed by many sequences. 
From the purely structural point of view nature does have several options in the sense that 
many sequences can be "candidate proteins". However, there is also evolutionary pressure 
to fold rapidly (i.e., a kinetic component to folding). This requirement further restricts the 
possible sequences that can be considered as proteins, because they must satisfy the dual 
criterion of reaching a definite fold on a biologically relevant time scale. These observations 
are schematically sketched in Fig. (4). 

Symmetry and designability: In the study by Li. et. al. |1| it was noted that highly 
designable structures appear to be symmetric. Independently, in a thought provoking ar- 
ticle Wolynes |21] has made a series of compelling arguments as to why nature might use 
symmetry (at least in an inexact manner) to generate symmetrical tertiary folds of proteins. 
Many enzymes are oligomers. Wolynes makes a number of observations about the symmetry 
aspects of protein structures: (a) The terameric hemoglobin (a-helical protein) molecule has 
an approximate two-fold symmetry, (b) A striking example of approximate symmetry in j3 
proteins is found in the the structure of a monomeric 7 crystallin in which the shapes adopted 
by residues 1-88 and 89-174 are nearly the same. However, the two individual sequences do 
not bear much similarity. This, of course, is consistent with the notion that the structure 
space is so sparse that many sequences are forced to adopt similar shapes. The interesting 
conclusion from examining the 7 crystallin structure is that the underlying symmetries in 
the shape are only inexact, (c) The obvious example of very nearly symmetrical structures 
are in helical proteins with the the four helix bundle being one the most prominent exam- 
ples (see Fig. (5)). (d) Various proteins with mixed topology (like TIM barrels and jelly 
rolls) appear to have the kind of inexact but apparent symmetrical arrangement discussed 
by Wolynes ||22|| . (e) He also conjectured that it is likely that the underlying approximate 



symmetry is reflected in the free energy landscape being funnel-like. This would facilitate 
rapid folding which for many proteins may be a result of evolutionary pressure. The precise 
connections between the symmetries and the folding mechanisms and functional competence 
of biological molecules have not been worked out. Nevertheless it appears that employing 
such ideas might be useful in de novo design of proteins. 

We note that equally striking are the kind of symmetrical arrangements found in RNA 
molecules PBfl. The crystal structure of the P4-P6 domain of Tetrahymena self-splicing RNA 
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clearly is highly symmetric with helices packed in a nearly regular arrangement. Since in an 
evolutionary sense the RNA world might have preceded the protein world it is interesting 
to speculate that the emergence of inexact design may have been a biological necessity. The 
observation of inexact symmetries in a protein structure might be a consequence of the fact 
that they are present in the "parent" molecules. In fact this evolutionary conservation may 
have been imprinted when evolution from RNA world to the current scheme for protein 
synthesis took place. The most compelling reason for observing near regular patterns in 
biomolecular structures is because synthesis of symmetrical folds might be energetically 
economical. 

Exploring protein folding mechanism using lattice model: It is well known that proteins 
reach the biologically active native states in a relatively short time which is on the order of 
a second for most single domain proteins UJ. Based on folding and refolding experiments on 
ribonuclease A Anfinsen concluded that under appropriate conditions natural sequences of 
proteins spontaneously fold to their native conformation |3[]. This implies that protein folding 
is self-assembly process i.e., the information needed for specifying the topology of the native 
state is contained in the primary sequence itself. This thermodynamic hypothesis does not, 
however, address the question of how the native state is accessed in a short time scale. This 
issue was raised by Levinthal who wondered how a polypeptide chain of reasonable length 
can navigate the astronomically large conformational space so rapidly. Levinthal posited 
that certain preferred pathways must guide the chain to the native state. The Levinthal 
paradox, simplistic as it is, has served as an intellectual impetus to understand the ease 
with which a polypeptide chain reaches the native conformation ||||. We use lattice models 
to describe the foldability of biological sequences of proteins. A sequence is foldable if it 
reaches the native state in a reasonable time and remains stable over some range of external 
conditions (pH, temperature). 



Characteristic Temperatures: The basic features of folding can be understood in terms of 
two fundamental equilibrium temperatures that determine the "phases" of the system [|7|]. 
At sufficiently high temperatures (kT greater than all the attractive interactions) the shape 
of the polypeptide chain can be described as a random coil and hence its behavior is the same 
as a self-avoiding walk. As the temperature is lowered one expects a transition at T = Tg 
to a compact phase. This transition is very much in the spirit of the collapse transition 
familiar in the theory of homopolymers ||. The number of compact conformations at Tg 
is still exponentially large. Because the polypeptide chains have additional energy scales 
that discriminate between the various compact conformations we expect a transition to the 
ground (native) state at a lower temperature Tp. Generally the transition at Tg is second 
order, while the transition at Tp is first order like. Such transitions here are for very small 
systems and the notion of "phases" should be used with care. These expectations, based 
on fairly general arguments, have been confirmed in various lattice simulations of protein- 
like heteropolymers. For the lattice models the collapse temperature Tg is determined from 
the peak of the specific heat and the folding transition temperature is obtained from the 
fluctuations in the overlap function given by 
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Ax =< x 2 > ~ < X > 2 



(4) 



where 

with rf- referring to the native state. In Fig. (7a) (for the structure displayed in Fig. (6)) 
we plot the temperature dependence of C v , which has a peak at Tg = 0.83. This figure also 
shows the variation of d < R g > jdT with temperature. The peak of this curve (0.86) 
almost coincides with that of the specific heat indicating that this transition is associated 
with compaction of the chain. Hence, the maximum in C v legitimately indicates the collapse 
temperature. X-ray scattering experiments have been used to obtain Tg for a few proteins. 
In Fig. (7b) we show the temperature dependence of Ax from which the folding temperature 
T F is determined to be 0.79. 



Folding Rates: The key question we want to answer is what are the intrinsic sequence 
dependent factors that not only determine the folding rates but also the stability of the 
native state? It turns out that many of the global aspects of folding kinetics of proteins can 
be understood in terms of the equilibrium transition temperatures. In particular, we will 
show that the key factor that governs the foldability of sequences is the single parameter 

°t = — Y e — ( 6 ) 

which indicates how far Tp is from Tg. To establish a direct correlation between the folding 
time Tp and ap we generated a number of sequences for N = 27. The folding time was taken 
to be equal to the mean first passage time. The first passage for a given initial trajectory was 
calculated by determining the total number of Monte Carlo steps (MCS) needed to reach the 
native conformation for the first time. By averaging over an ensemble of initial trajectories 
(typically this number varies between 400 - 800 in our examples) the mean first passage time 
is obtained. The precise moves that are utilized in the simulations are described elsewhere 
|17| . The dependence of Tp on <jp (for the random bond model and for the other interaction 
schemes) is given in Fig. (8). This figure shows a remarkable correlation between the folding 
time and op. A small change in op results in a dramatic effect (a few orders of magnitude) 
on the folding times. It is clear that both Tg and Tp are dependent on the sequence. As a 
result mutations that preserve the native state can alter the folding rates due to the change 
in the ap values. 

Using lattice models we have also established that folding rates correlate well with Z = 
(E N —E M s/5, where E N is the native state energy, E MS is the average energy of the ensemble 
of misfolded structures, and 5 is the dispersion in the contact energies. The relationship 
between a and Z also suggests that, in general, the correlation between Tp and a should 
be superior. More importantly, experimental measurements of Z are difficult. On the other 
hand, both Tg amd Tp can be measured in scattering, CD, or fluorescence experiments. 
Other measures, such as energy gap (however, it is defined), do not correlate with Tp. 

In the previous section we showed that because the structure space is very sparse there 
have to be many sequences that map onto the countable number of basins in the structure 
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space. The kinetics here shows that not all the sequences, even for highly designable struc- 
tures, are kinetically competent. Consequently, the biological requirements of stability and 
speed of folding severely restrict the number of evolved sequences for a given fold. This very 
important result is schematically shown in Fig. (4). 

It is important to point out that the simulations reported in Fig. (8) were done at 
sequence dependent temperatures using the condition < x(T s ) >= 0.21. At these temper- 
atures, all of which are below their respective folding transition temperatures, the native 
conformation has the highest occupation probability. In lattice models the native state is a 
single conformation (a microstate) which is, of course, physically unrealistic. In real systems 
there is a volume associated with the native basin of attraction (NBA) and there are many 
conformations that map onto the NBA. The probability of being in the NBA at the various 
simulation temperatures is in excess of 0.5 so that under the conditions of our simulations 
the stability criterion is automatically satisfied. The results in Fig. (8), therefore, shows 
that the dual requirement of stability and the kinetic accessibility of NBA is most easily 
satisfied by those sequences that have small values of a T . Thus rapid folding occurs when 
ot ~ i.e., near a multicritical-like point. In this case there are no detectable intermediate 
"phases". The sequence, whose native state is shown in Fig. (6) has o~t = 0.05. We found 
that this sequence folds rapidly. 

Topological Frustration and Kinetic Partitioning Mechanism: Lattice models can 
also be used to obtain the outlines of the mechanisms for folding of proteins. The qualitative 
aspects of the folding kinetics of biomolecules can be understood in terms of the concept 
of topological frustration. The primary sequence of proteins has about 55% hydrophobic 
residues. The linear density of hydrophobic residues along the polypeptide chain is roughly 
constant implying that the hydrophobic residues are spread throughout the chain. As a 
result on any length scale / there is a propensity for the hydrophobic residues to form tertiary 
contacts under folding conditions. The resulting structures which are formed by contacts 
between residues that are in proximity would be in conflict with the global fold corresponding 
to the native state. The incompatibility of structures on local scales with the near unique 
native state on the global scale leads to topological frustration. It is important to realize 
that topological frustration is inherent to all foldable sequences, and is a direct consequence 
of the polymeric nature of proteins as well as competing interactions (hydrophobic residues 
which prefer the formation of compact structures and hydrophilic residues which are better 
accommodated by extended conformations). A consequence of topological frustration is that 
the underlying energy landscape is rugged consisting of many minima that are separated by 
barriers of varying heights. 

It is important to understand the nature of the low-lying minima in the rugged energy 
landscape. On the length scale / there are many ways of forming structures that are in 
conflict with the global fold. It is expected that most of these structures have high free 
energies and are unstable to thermal fluctuations. We expect a certain number of these 
structures to have low free energies and be relatively deep minima. The overlap between 
these structures and the native fold could be considerable and hence these structures could 
be viewed as being native-like. These competing basins of attraction (CBA) in which the 
polypeptide chain adopts native-like structures can act as kinetic traps that will slow down 
the folding process. 
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The basic consequences of topological frustration for mechanisms of folding can be un- 
derstood in terms of the kinetic partitioning mechanism (KPM) . Imagine an ensemble of 
denatured molecules in search of the native conformation. This is the experimental situation 
that arises when the concentration of denatured molecules is decreased. It is clear that a 
fraction of molecules $ would reach the NBA rapidly without being trapped in the low lying 
energy minima. The remaining fraction would be trapped in the minima and only on longer 
time scales fluctuations enables the chain to reach the NBA. The value of the partition factor 
$ depends on the sequence and is explicitly determined by the a? value. Thus because of 
topological frustration the initial pool of denatured molecules partitions into fast folders and 
slow folders that reach the native state by indirect off-pathway processes. 

From the description of the kinetic partitioning mechanism (KPM) given above it follows 
that generically the time dependence of the fraction of molecules that have not folded at 
time t, P u (t), is given by 

P u (t) = $exp( — )+^a fc exp(-— ) (7) 

T NCNC k T k 

where tncnc is the time constant for reaching the native state by the fast process, r k 
is the time for escape from the CBA labeled k, and a k is the "volume" associated with 
the k th CBA. From this consideration we expect that for a given sequence trajectories can 
be grouped into those that reach the native conformation rapidly ($ being their fraction) 
and those that remain in one of the CBA for discernible length of time. In Fig. (9a) we 
show an example of trajectory that reaches the native state directly from the random coil 
conformation. In contrast in Fig. (9b) we show an example of a trajectory for the same 
sequence at the same simulation temperature. This figure shows that on a very short time 
scale the chain gets trapped in conformations other than the NBA and only on long time 
scale it reaches the native state. This figure illustrates the basic principle of KPM. If we 
perform an average over an ensemble of such trajectories the kinetic result given in Eq. (|7|) 
ensues. 



Classifying Folding Mechanisms in terms of er^: The various folding mechanisms 
expected in foldable sequences may be classified in terms of the ot- We have already shown 
that sequences that fold extremely rapidly have very small values of <jt- Based on our study 
of several model proteins as well as analysis of real proteins we classify the folding kinetics 
of proteins as follows |7|]: 



Fast Folders: For these sequences the value of <jt is less than a certain small value o\. For 
such sequences the folding occurs directly from the ensemble of unfolded states to the NBA. 
The free energy surface is dominated by the NBA (or a funnel) and the volume associated 
with NBA is very large. The partition factor <3> is near unity so that these sequences reach 
the native state by two-state kinetics. The amplitudes a k in Eq. (0) are nearly zero. There 
are no intermediates in the pathways from the denatured state to the native state. Fast 
folders reach the native state by a nucleation-collapse mechanism which means that once 
a certain number of contacts (folding nuclei) are formed then the native state is reached 
very rapidly P3 , |23| . The time scale for reaching the native state for fast folders (which are 
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normally associated with those sequences for which topological frustration is minimal) is 
found to be 

TNCNC = —f{° T )N" (8) 

7 

where r\ is the solvent viscosity, a is the typical size of a residue, 7 is the average surface 
tension between the residue and water, /(cry) is typically an exponential function of o^, and 
the exponent uj is between 3.8 and 4.2. In general, only small proteins (N less than about 
100) are fast folders. 



Moderate Folders: Sequences for which 07 < < ah can be classified as moderate fold- 
ers. Unlike fast folding sequences the $ values are fractional which means that a substantial 
fraction of molecules is essentially trapped in one of the CBAs for some length of time. 
For these sequences there are detectable intermediates and for all but very small proteins 
the rate determining step is the activated transition from one of the CBAs to the native 
state. The average time scale for transition from these misfolded structures to the native 
conformation is given by 

r F ^r exp(v / iV) (9) 

at T pdTp- This shows that typical barriers for moderate folders are quite small. As a result 
the folding times even for long proteins (N ~ 200) are only on the order of a second. It is 
these small barriers that enable typical proteins to fold in biologically relevant time scale 
without encountering the Levinthal paradox. 



Slow Folders and Chaperones: For sequences with oj- > ah folding is extremely slow and 
these sequences may not reach the native state in a biologically relevant time scales. The 
volume corresponding to NBA is very small in this case and as a result $ is nearly zero. The 
free energy surface is dominated by CBAs. Under these circumstances spontaneous folding 
does not become viable. In cells such proteins are rescued by chaperones. Typically this 
happens when N is so large that r exp(\/N) exceeds reasonable folding time scales. Thus 
in cells we expect that only those proteins which are large or whose biological functioning 
state has to be oligomers require chaperones. 

Minimum number of residues for obtaining foldable protein structures: Natural proteins 
are made up of twenty amino acid residues. An important question, from the perspective of 
protein design, is how many distinct types of residues are required for protein-like behavior? 
Such a selection cannot be made arbitrarily because in natural proteins one should have 
polar, hydrophobic, and charged residues. In addition, for optimal packing of the core 
hydrophobic residues with different van der Waals radii may be required. To explore the 
potential simplification of the number of residues Wang and Wang |M] have carried out 
a highly significant study using lattice models and standard statistical potentials for the 
contact interaction elements By (Eq. (19)). They discovered that a grouping of amino acid 
residues into five categories mimics the folding behavior found using the standard twenty 
residues. To demonstrate this they used a cubic lattice with iV = 27, and mostly focussed 
on the maximally compact structures as ground states. Thus, structures such as ones given 
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in Fig. (6) are not explicitly considered. Nevertheless, the demonstration that a suitable 
set of five amino acid residue types is sufficient is an important result which should have 
implication for protein design problem - the generation of primary sequences that can fold 
to a chosen target folded structure. 

In their original article they mostly focussed on various thermodynamic properties (na- 
ture and degeneracy of the ground states). They have also carried out kinetic simulations 
to assess if the kinetic properties are altered by using a reduced number of residues. To 
test this idea Wang and Wang used the foldability index o (which correlates well with fold 
rates) as a discriminator of sequence properties. The precise question addressed by Wang 
and Wang is the following: What is the minimum number of residues that are required to 
obtain foldable (characterized by having relatively small values of a) sequences? We find 
that fast folding sequences have a less than about a quarter. They carried out two sets of 
computations. In one set they initially optimized the stability gap || of various sequences 
using the twenty residues. They substituted the residues in these optimized sequences by the 
representative residue for each group. Four subgroups were considered with each containing 
five and a variant, three and two amino acids. The foldability index for the standard sample 
and their substitutes is shown in Fig. (10) as solid circle. In another set of computations 
they examined the foldability index (open diamonds in Fig. (10)) for sequences that were 
optimized using the reduced sets of amino acids. Both these curves show that as long as 
the number of amino acid types exceed five one can generate sequences with relatively small 
values of a. Fig. (10) also shows that smaller values of a can be obtained if optimization 
is carried out with reduced sets of amino acids. Such sequences are foldable i.e, the dual 
requirements of stability over a wide temperature range and the kinetic accessibility of their 
native states are simultaneously satisfied. 



IV. CONCLUSIONS 

The examples of modeling discussed in Sections (II) and (III) are meant to illustrate 
the ideas behind theoretical and computational approaches to protein folding. It should be 
borne in mind that we have discussed only a very limited aspect of the rich field of protein 
folding. The computations described in Section (III) can be carried out easily on a desktop 
computer. Such an exercise is, perhaps, the best of way of appreciating the simple approach 
to get at the principles that govern the folding of proteins. 

In this chapter we have not discussed experimental advances that are offering extraordi- 
nary insights into the way the denatured molecules reach the native state. Two remarkable 
experimental approaches hold the promise that in short order we will be able to watch the 
folding process from submicrosecond time scale till the native state is reached. A brief 
summary of these follow. 

1) Eaton and coworkers showed in 1993 that optical triggers of folding can offer a window 
into the folding process from microsecond time scale p7| |. Since this pioneering work many 



laboratories have probed the plausible structure formation that occur on short time scale. 
Fast folding experimental techniques have been used to obtain detailed kinetics for the 
building blocks of proteins, namely, /3-hairpin, a-helices, and loops. Very recent experiments 
have given compelling evidence that there are populated native-like intermediates even in 
proteins that were thought to follow two-state kinetics. 
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2) Perhaps the most exciting development in the last few years is the ability to nanoma- 
nipulate single biomolecules using atomic force microscopy and optical tweezer techniques 
PSfl . So far such experiments have been used to provide a microscopic basis of elasticity in 



muscle proteins. If these stretching experiments can be combined with fluorescent resonance 
energy transfer experiments then it is possible to follow the folding of individual molecules 
as it passes through the transition state to the native conformation. It has been suggested 
on theoretical grounds that such two-dimensional single molecule experiments can measure 
directly the distribution of folding rates (and the barrier distribution) in much the same way 
mean first passage times are computed in minimal protein models (see Section (III)). 

The challenges posed by these high precision experiments demand more refined models 
and further developments in computational techniques. For the theoretically inclined it will 
no longer be sufficient to describe kinetics only in terms of energy landscapes. The wealth of 
data that are being generated by experiments, such as the ones mentioned above, requires 
quantitative understanding of the various factors that govern the pathways, mechanisms, 
and the transition states in the folding process. These challenging issues will make the area 
of biomolecular folding an engaging one for many years to come. 
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APPENDIX A: 

There are several versions of the random heteropolymer models. To keep the discussions 
technically simple we will consider one case - the so called random hydrophilic-hydrophobic 
chain whose phases were studied by Garel, Orland and Leibler (GLO) |]I0"| . The GLO 
model consists of a polymer chain with N monomers. The GLO model can be viewed as a 
generalization of the popular Edwards model which was introduced to understand swelling 
of real homopolymer chains in good solvents 0. In the GLO model the chain is made up 
of hydrophobic (hydrophilic) residues that tend to collapse (swell) the chain when dispersed 
in a solvent. The solvent mediated interactions at each site is assumed to be random. The 
random interactions depend only on a given site i and the strength depends on the degree 
of hydrophilicity Aj. Besides the term accounting for chain connectivity there are two and 
higher body interactions that determine the shape of the chain. In the GLO model the two 
body interaction is given by 

Vij = v + (3(Xi + Xj)6[ri - r 5 ] (Al) 

where vq is repulsive short range interaction, A, is a quenched random variable which is 
distributed as 

^Vfb^-Tl' (A2, 
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If the mean Ao is positive then the majority of the residues are hydrophilic. Description of 
the collapsed phase of the chain requires introducing three and and four-body interaction 
terms. Thus, the total Hamiltonian is 

l 3H =\Y, v ii + \ E ^(ri - rj)5(ri - r k ) + — ^ ^S{n - rj)6(rj - r k )S(r k - n). 

(A3) 

Since the charge variables A, are quenched the thermodynamics of the system requires av- 
eraging the free energy using the distribution P(Aj) i.e., 

F = -k B T J [] P(\) In ZiXMX,}. (A4) 

The average of lnZ(Aj) is most conveniently done using the replicas through the relation 

lim Z n — 1 

lnZ=n^0 . (A5) 

n 

Using Eqs. (2-4) the required average can be carried out. This leads to a complicated 
expression for Z n where the bar indicates the average over the quenched random variables 
Aj. In terms of the order parameters 

q ab (r, r') = J ds5(r a (s) - r)5(r b (s) - r') (A6) 

and 

Pair-) = J ds5(r a (s) - r) (A7) 
(a and b are replica indices) the expression for Z n becomes 

Z* = j Dq ab (r, r')Dq ab (r, r')Dp a (r)D<f> a (r) exp[H eff ] (A8) 



where 



with 



H eff = G(q ab , q ab , p a , (p a ) + \n((q ab , <j> a ) (A9) 

G = J dr^Ma - (v + 2/3A )| - ^ _ ^ 

+ J dr j dr'J2(iq ab (r,r')q ab (r,r')+f3 2 \ 2 qab(r,r')p a (r)p b (r')) 



a<b 



and 



((qabAa) = f YlDr a {s)exp(-H T {r a (s)}). (A10) 

J a 

In Eq. (10) H T {r a (s)} is 
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d f N dr f N f N 

H T {r a (s)} = exp[- — / f - i ds^(j) a {r a {s)) - i I ds^q ab (r a ,r b )}. 

2a 2 Jo a ds Jo " Jo a<b 

(All) 

and 

u' 3 = w 3 - 3(3 2 X 2 . (A12) 

The path integrals in Eq. (10) may be evaluated using the spectrum of the effective n-body 
Hamiltonian 

#n = -^E V '+E #W + E i{ lab(r a , T b ) (A13) 
ZCl a a a <b 

in the limit of n —>■ 0. If iV is very large then we can use ground state dominance to evaluate 
the spectrum of H n . This gives 

4>a) ^ exp[-iV min {< V \ H n \ * > -E {< * | * > _l}] (A14) 
{*(0> 

where £? is the ground state energy of H n . GLO evaluated the integral over q ab (Eq. (6)) 
by a saddle point approximation which leads to 

iq a b(r,r') = -f3 2 \ 2 p a (r)p b (r'). (A15) 

From the above equation it follows that in the mean-field limit replica symmetry is not 
broken. This makes the GLO model conceptually simpler to interpret than the random 
bond heteropolymer model discussed in the Appendix. 

The total wavefunction \l/{ri,r 2 , r n } (Eq. (12)) is written as a product of single 

particle functions (Hartree approximation). The various integrals are evaluated in the saddle 
point approximation. A simple Gaussian form for the trial one particle wavefunction 

1 , r 2 



is chosen with R being the single variational parameter. Upon performing the Gaussian 
integrals the free energy per monomer / becomes 

of a2 | 1 (^o + 2/3A ) iV 

^-%tv + (2^y 2 R* + il (A17) 

where 

a - - ^- in - ^4> 2 + < pW^I^ 3 < ais > 

At low temperatures the shape of the chain is determined by the sign of first term in Eq. 
(18). If the sign is negative then the positive four body term is required for a stable theory. 

The phase of the random hydrophobic-hydrophilic model is complicated and depends on 
the value of Aq |10 |- We only describe the hydrophilic case when Aq is positive. In this case 
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there is a first order transition to a collapsed state (R ~ iV _1//d ) induced by the negative 
three body term. GLO pointed out that this transition is neither the usual ^-point nor 
is it a freezing temperature because there is no replica symmetry breaking. In fact, this 
collapse transition resembles that seen in proteins where it is suspected that it is first order 
transition. The microscopic origin of the first order transition upon collapse of polypeptide 
chains is not fully understood. Recent arguments, suggest that it could arise because burial 
of hydrophobic residues and accommodation of the hydrophilic ones at the surface of proteins 
in water requires some work and perhaps this assembly happens in a discontinuous manner. 



APPENDIX B: 

In Section (II) we considered a variational-type theory to treat the thermodynamics of 
the random hydrophobic-hydrophilic heteropolymer. Here we describe a limiting behavior of 
the random bond model ||29|| . In this appendix we show that the random bond model in the 



the compact phase is identical to the Random Energy Model (REM). Historically, REM was 
proposed as caricature for proteins on phenomenological grounds ||12|| . The heteropolymer 



with random bond interactions was treated using a variational theory which suggested that 
when the disorder increases beyond a limiting value the chain undergoes a thermodynamic 
glass transition. The nature of this transition is closely related to Potts glasses. 

The random-bond heteropolymer is described by a Hamiltonian similar to Eq. (1) except 
the short range two-body term Vij is taken to be random with a Gaussian distribution. In 
this case a three-body term with a positive value of u 3 is needed to describe the collapsed 
phase. The Hamiltonian is 

H = E(^o + v tj )5{r t - r 3 ) + £ 5( n - r 3 )5{r 3 - r k ) (Bl) 
The distribution of the random couplings is given by 

"M-jh"*^- (B2) 

In the collapse phase the monomer density p — j^g is constant (for large N). Thus, the 
only conformation dependent term in Eq. (Al) comes from the random two-body term. 
Because this term is linear combination of Gaussian variables we expect that its distribution 
is also Gaussian and hence, can be specified by the two moments. Let us calculate the 
correlation EiE 2 between the energies E\ and E 2 of two conformations {r[ } and {r| 2 ' ) } of 
the chain in the collapsed state. The mean square of E\ is 

^F=yE^ (1) -^ 1) ) = y^ (B3) 

which is independent of the collapsed conformation. Similarly, we have 

v 2 

= yE9i 2 2( r - r ') 

r,r' 
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where qi2(r, r') is the overlap between the two conformations. Because 

E<?i 2 (r,r') = iV (B4) 

r,r f 

2 

and since the monomer density is constant we qi2(r,r') = fj- . This implies 



Thus the joint probability is 



E 1 E 2 = —p 2 . (B5) 



hm ~exp(- E { + f 2 ) (B6) 



which is equivalent the behavior in uncorrelated REM. Thus, it is not a surprise that in 
large dimensions (which are captured by variational type treatments) the random bond 
heteropolymer model yields exactly the same result as the REM. 



18 



REFERENCES 



[1] L. Stryer. Biochemistry. W. H. Freeman, 1988. 

[2] G. H. Lorimer. A quantitative assessment of the role of the chaperonin proteins in 

protein folding in vivo. FASEB J, 10:5-9, 1996. 
[3] C. B. Anfinsen. Principles that govern the folding of protein chains. Science, 181:223- 

230, 1973. 

[4] P. T. Lansbury. Evolution of amyloids: What normal protein folding can tell us about 
fibrillogenesis and disease. Proc. Natl. Acad. Sci. (USA), 96:3342-3344, 1999. 

[5] J. N. Onuchic, Z. A. Luthey-Schulten, and P. G. Wolynes. Theory of protein folding: 
An energy landscape perspective. Ann. Rev. Phys. Chem., 48:545-600, 1997. 

[6] K. A. Dill and H. S. Chan. From Levinthal to pathways to funnels. Natur. Struct. Biol, 
4:10-19, 1997. 

[7] D. Thirumalai and D. K. Klimov. Deciphering the timescales and mechanisms of protein 

folding using minimal off-lattice models. Curr. Opin. struct. Biol, 9:197-207, 1999. 
[8] E. Shakhnovich. Protein design: a perspective from simple tractable models. Folding & 

Design, 3:R45-R58, 1998. 
[9] P. G. deGennes. Scaling Concepts in Polymer Physics. Cornell University Press, 1985. 
[10] T. Garel, H. Orland, and D. Thirumalai. Analytical theories of protein folding. In R. El- 

ber, editor, New Developments in Theoretical Studies of protein folding, pages 197-268. 

Singapore, World Scientific, 1996. 
[11] M. Virasaro M. Mezard and G. Parisi. Spin glass theory and beyond. World Scientific, 

1987. 

[12] J. D. Bryngelson and P. G. Wolynes. Spin glasses and the statistical mechanics of protein 

folding. Proc. Natl. Acad. Sci. (USA), 84:7524-7528, 1987. 
[13] K. A. Dill, S. Bromberg, K. Yue, K. M. Fiebig, D. P. Yee, P. D. Thomas, and H. S. 

Chan. Principles of protein folding - a perspective from simple exact models. Protein 

Science, 1995:561-602, 1995. 
[14] W. J. C. Orr. Statistical treatment of polymer solutions at infinte dilution. Tran. Faraday 

Soc, 43:12-27, 1947. 

[15] H. Taketomi, Y. Ueda, and N. Go. Studies on protein folding, unfolding, and fluctuations 
by computer simulation. Int. J. Pept. Protein Res., 7:445-459, 1975. 

[16] H. S. Chan and K. A. Dill. Intrachain loops in polymers:effects of excluded volume. J. 
Chem. Phys., 90:493-509, 1989. 

[17] D. K. Klimov and D. Thirumalai. Factors governing the foldability of proteins. Proteins: 
Struct. Fund. Genet., 26:411-441, 1996. 

[18] D. Thirumalai and D. K. Klimov. Emergence of stable and fast folding protein structure. 
In S. Kim, K. J. Lee, and W. Sung, editors, Stochastic dynamics and Pattern Forma- 
tion in Biological and Complex Systems, pages 95-111. Melville, New York, American 
Institue of Physics, 2000. 

[19] H. Li, N. Winfreen, and C. Tang. Emergence of preferred structures in a simple model 
of protein folding. Science, 273:666-669, 1996. 

[20] Per-Anker Lindgard and H. Bohr. Magic numbers in protein structures. Phys. Rev. 
Letters, 77:779-782, 1996. 

[21] P. G. Wolynes. Symmetry and the energy landscape of bio molecules. Proc. Natl. Acad. 
Sci. (USA), 93:14249-14255, 1996. 



19 



[22] P. G. Wolynes. Folding nucleus and energy landscapes of larger proteins within the 

capillarity approximation. Proc. Natl. Acad. Sci. USA, 94:6170-6175, 1997. 
[23] J.H. Cate, A.R. Gooding, E. Podell, K. Zhou, B.L. Golden, C.E. Kundrot, T.R. Cech, 

and J. A. Doudna. Crystal structure of a group i ribozyme domain: Principles of rna 

packing. Science, 273:1678-1685, 1996. 
[24] Z. Guo and D. Thirumalai. Kinetics of protein folding: Nucleation mechanism, time 

scales and pathways. Biopolymers, 36:83-103, 1995. 
[25] E. I. Shakhnovich, V. Abkevich, and E. I. Shakhnovich. Conserved residues and the 

mechanism of protein folding. Nature, 379:96-98, 1996. 
[26] J. Wang and W. Wang. A computational approach to simplifying the protein folding 

alphabet. Nat. Struct. Bio., 6:1033-1038, 1999. 
[27] W.A. Eaton, V. Munoz, P.A. Thompson, E.R. Henry, and J. Hofrichter. Kinetics and 

dynamics of loops, a- helices, /3-hairpins, and fast-folding proteins. Acc. Chem. Res., 

31:745-753, 1998. 

[28] T.E. Fisher, A.F. Oberhauser, M. Carrion- Vazquez, P.E. Marszalek, and J.M. Fer- 
nandez. The study of protein mechanics with the atomic force microscope. Trends in 
Biochemical Sciences, 24:379-384, 1999. 

[29] E. Shakhnovich and A. Gutin. Formation of unique structure in polypeptide chains, the- 
oretical investigation with the aid of replica approach. Biophysical Chemistry, 34:187- 
199, 1989. 

[30] R. Sayle and E.J. Milner- White. Rasmol: Biomolecular graphics for all. Trends in Bio- 
chemical Sciences, 20:374-376, 1995. 



20 



FIGURES 



Fig. (1) The 3D native structure of hemoglobin visualized using RasMol 2.6 |30[. The 
linear sequence of amino acids of hemoglobin is given below the figure. 
Fig. (2) Scaling of the number of MES C(MES) (squares) is shown for the hydrophobic 
parameter Bq = —0.1 and A = 0.6. Data are obtained for the cubic lattice. The pairs of 
squares for each N represent the quenched averages for different samples of 30 sequences. 
The number of compact structures C(CS) and self-avoiding conformations C(SAW) are also 
displayed to underscore the dramatic difference of scaling behavior of C(MES) and C(CS) 
(or C(SAW)). It is clear that C(MES) remains practically flat, i.e. it grows no faster than 
InN. 

Fig. (3) Histogram of number of structures with a given number of associated sequences 
N s for 3D 3x3x3 case, in a log-log plot. 

Fig. (4) Schematic illustration of the stages in the drastic reduction of sequence space in 
the process of evolution to functionally competent protein structures. 

Fig. (5) Native structure of acyl-coenzyme A binding protein (first NMR structure out of 
29 deposited to PDB). The figure was created using RasMol 2.6 []50| . 

Fig. (6) The native conformation of fast folding sequence (N = 27) with random bond 
potentials is shown. This structure has c = 22 non-bonded contacts, therefore it is not a 
maximally compact conformation for which c = 28. The figure was created using RasMol 
2.6 II. 

Fig. (7) (a) Thermodynamic functions computed for the sequence whose native state is 
shown in Fig. (3). (a) Specific heat C v (dotted line) and derivative of the radius of gyra- 
tion with respect to temperature dR g /dT (dashed line) as a function of temperature. The 
collapse temperature T e is determined from the peak of C v and found to be 0.83. T e is 
very close to the temperature, at which dR g /dT becomes maximum (0.86). This illustrates 
that Tq is indeed associated with the compaction of the chain. Temperature dependence of 
fluctuations of overlap function Ax is given by solid line. The folding transition temperature 
Tp is obtained from the peak of Ax and for this sequence T F = 0.79. The curves are scaled 
to fit one plot. 

(b) Time dependence of the fraction of unfolded molecules P u (t) for the sequence 74 calcu- 
lated at folding conditions T s < Tp. The function P u (t) is computed from a distribution 
of first passage times Tu. First passage time for a given initial condition is the first time 
the trajectory reaches the native conformation. Typically an adequately converged distri- 
bution is obtained by averaging over several hundred initial conditions. For the conditions 
used in this simulation folding is two-state, therefore, P u {t) is adequately fit with the single 
exponential (thick solid line). The folding time Tp obtained from the fit is 1.4 x 10 G MCS. 
Fig. (8) Plot of the folding times Tp as a function of a F for the 22 sequences. This figures 
shows that under the external conditions when the NBA is the most populated there is a 
remarkable correlation between Tp and a F - The correlation coefficient is 0.94. It is clear 
that over a four orders of magnitude of folding times Tp ~ exp(— <tt/co) where <To is a 
constant. In both panels the filled and open circles are for the RB and KGS 27-mer models, 
respectively. The open squares are for N = 36. 

Fig. (9) Examples of folding trajectories at T = T s derived from the condition < x{T s ) >— 
0.21. (a) Fast folding trajectory as monitored by x{t)- It is seen that sequence reaches the 
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native state very rapidly in a two-state manner without being trapped in intermediates. 
The first passage time for this trajectory is 277,912 MCS. (b) Slow folding trajectory for the 
same sequence. The sequence becomes trapped in several intermediate states with large x 
en route to the native state. The first passage time is 11,442,793 MCS. Notice that the time 
scales in both panels are dramatically different. 

Fig. (10) The figure gives the foldability a of 27-mer lattice chains with sets containing 
different number of amino acids. The sets are generated according to scheme described 
in [26]. The set of 20 amino acids is taken as a standard sample. Each sequence with 



20 amino acids is optimized to fulfill the stability gap g. The residues in the standard 
samples are substituted with four different sets containing smaller number of amino acids 
|2(|. The foldability of these substitution are indicated by the solid circles. The open 



diamonds correspond to the sequences with same composition. However, the amino acids 
are chosen from the reduced representation and the result sequence is optimized using the 
stability gap H. 
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Fig. (2) 
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Fig. (4) 
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Fig. (6) 
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